Univariate Section

Plots

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The strongest white wine is 14.2% alcohol, and the weakest, 8%. All white wines are acidic ranging from ph 2.7 to 3.8. The median quality of wine on a scale from 0 - 10 is 6 with a mean of 5.9. The highest marked wine had a score of 9 and the lowest 3.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The wine ratings are whole values ranging from 3-9. The distribution appears normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The alcohol level in the dataset appears to be somewhat positively skewed, using the log2 of the value makes it more normalily distributed, however the mode appears to be lower than the median.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH appears normally distributed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a couple of outliers, when removed the chart is more normal.

Volatile acidic acid in large quanitites can lead to an unpleasant vinegar taste in wine, I wonder wether the wine quality will correlate to this?

Looks like this distribution maybe bimodal with 2 peaks 1 around 1 and the other around 3, perhaps for different qualities of wine, maybe some fruitier, and others dry?

Sulphates can contribute to levels of sulphor dioxide in wine, so I would assume suplhate levels were correlated to gas levels? In high concentrations this can be detectable in wine, I wonder weather this will affect the quality? Worked out the amount of bound sulphor dioxide by subtracting total by free.

Amount of salt in the wine, I assume at larger concentration it would affect the taste and quality.

Analysis

What is the structure of your dataset?

There are 4898 observations in the dataset. Qualities range from integer values 3-9, and are normally distributed.

Other observiations: Median Alcohol content is 10.40 Median quality in 6. Most wines in the dataset are dry, with the median residual sugar value being 5.2

What is/are the main feature(s) of intrest in your dataset?

The main feature of intrest to me is the quality score, I want to see if we can estimate the quality of wine based on its properties.

What ther features in the dataset do you think will support your investigation into your feature(s) of intreast?

I think some of the features of intrest which may affect the quality of wine are volatile.acidity (high levels can make wine taste like vinegar), free.sulfur.dioxide (high levels can be detected by taste/nose) and the ratio of acid to sweet (See below)

Did you create any new variables from existing variables in your dataset?

I found an article on wikipedia about wine tasting: http://en.wikipedia.org/wiki/Acids_in_wine#In_wine_tasting, it says that an important factor in the quality of wine is the balance of acidity vs. sweetness. So I also calculated a sweet to acid ratio.

wine$acid.sweet.ratio <- (wine$fixed.acidity 
                          + wine$volatile.acidity) / wine$residual.sugar
summary(log(wine$acid.sweet.ratio))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0160 -0.2992  0.2803  0.4769  1.4040  2.7010

I also added the bound level of sulphates

wine$bound.sulfur.dioxide <- wine$total.sulfur.dioxide - wine$free.sulfur.dioxide

Of the features you investigated were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so why?

I found this artical so decided to group the wine as specified by the sweetness levels dictated by the EU http://en.wikipedia.org/wiki/Sweetness_of_wine#Residual_sugar

#Add groupings according to wine research.
wine$type <- ''
wine$type[wine$residual.sugar > 45] <- 'Sweet'
wine$type[wine$residual.sugar < 45 
          & wine$residual.sugar > 18] <- 'Medium'
wine$type[wine$residual.sugar < 9 
          & (wine$residual.sugar - wine$fixed.acidity < 2)] <- 'Dry'
wine$type[(wine$residual.sugar > 9 
           & wine$residual.sugar < 18) 
          & (wine$residual.sugar - wine$fixed.acidity < 10)] <- 'Medium Dry'
wine$type[(wine$residual.sugar < 4) & wine$type == ''] <- 'Dry'
wine$type[(wine$residual.sugar > 12 
           & wine$residual.sugar < 45) & wine$type == ''] <- 'Medium'
wine$type[(wine$residual.sugar > 4 
           & wine$residual.sugar < 12) & wine$type == ''] <- 'Medium Dry'
wine$type <- ordered(wine$type, levels = c("Dry", "Medium Dry", "Medium","Sweet"))

##        Dry Medium Dry     Medium      Sweet 
##       3442       1290        165          1

Judging by the sample it looks like this variety of grape is used to produce mainly dry, or medium dry wines. There are only a small proportion of medium and sweet wines in the sample.

Bivariate Section

Bivariate Plots

There are some obvious correlations that I dont find very intresting such as, level of citric acid and fixed acidity, level of free sulphur dioxide and bound sulphur dioxide, level of fixed acidity and ph.

Quality seems to be affected to some extent in decreasing levels of influence by alcohol level (0.436 correlation) .., density (-0.307) .., chlorides(-0.21), bound.sulphur.dioxide (-0.21) and volatile acidity(-0.195), i would like to examine these more.

There appears to be a generally increase in quality of wine as alcohol level increases, as shown by the positive correlation (0.436). There appears more variance at higher and lower concentrations. There are strips in the scoring which are because of the integer quality scores.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##   (7,8]   (8,9]  (9,10] (10,11] (11,12] (12,13] (13,14] (14,15] 
##       2     500    1583    1252     850     609     100       2
## Source: local data frame [8 x 6]
## 
##   alcohol.bucket lower median upper     mean    n
## 1          (7,8]   3.5      4   4.5 4.000000    2
## 2          (8,9]   5.0      5   6.0 5.606000  500
## 3         (9,10]   5.0      5   6.0 5.487682 1583
## 4        (10,11]   5.0      6   6.0 5.864217 1252
## 5        (11,12]   6.0      6   7.0 6.190588  850
## 6        (12,13]   6.0      7   7.0 6.571429  609
## 7        (13,14]   6.0      7   7.0 6.720000  100
## 8        (14,15]   7.0      7   7.0 7.000000    2

Here I have split the alcohol levels into buckets, so that we can look at the data in a slightly different way. You see a very evidentially rise in the median quality scores as alcohol level increases.

Here is a plot removing the outliers (lower and upper 1%) and plotting a trend line. This shows the slight negative correlation between density and quality (-0.307).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most wines have chloride levels between 0.01 and 0.045. There are a several outliers with varing qualities of wine. In my later plots I removed the bottom 1% and top 4% from my plot. This shows that there is a small negative correlation between chlorides and quality (-0.21).

A high concentraion of bound sulphur dioxide appears to affect the quality of the wine. It displays the slight negative correlation (-0.21), however there are few data points at larger concentraions. The lata plot has the upper and lower 1% of points removed.

A high volatile.acidity concentration appears to affect the quality of the wine. There is a slight negative correlation (-0.195) however there are few data points at larger concentraions. The lata plot has the upper and lower 1% of points removed.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

There is a strong negative correlation between alcohol and density (-0.78).

Bivariate Analysis

Talk about some of the relationships you observed in the investigation. How did the feature of intrest vary with other features in the dataset?

The strongest correlation with quality was alcohol content. As alcohol content goes up quality tends to go up cor(0.436). You can see this more evidentally in the box plot that I produced which groups data by alcohol content in 1% alcohol increments.The mean for each of the groups increase as alcohol level increases.

Density also is negatively correlated to the quality of wine, which makes sense because alcohol and density are strongly negatively correlated and alcohol and quality are positively correlated.

The other variables I looked at showed some correlation. High concentration of bound sulphur dioxide, chlorides and volatile acidity tend to in advertantly affect the wine quality, and have small negative correlations with quality.

I would have expected the acid to sweetness ratio to correlate more with quality, it showed a very small negative correlation (-0.015). I would still like to explore this further in the multivariate section since research has indicated that this balance is important in wine quality.

Did you observe any other intresting relationships between the other features (not the main feature of intrest)?

Residual sugar appears negatively correlated with alcohol level, I assume this is because sweeter wines tend to be less strong?

What was the strongest relationship you found?

The strongest correlation was between residual sugar and density (0.839). I assume this is because sugar molecules are more dense than water and so as the concentraion increases density increases.

Multivariate Section

Multivariate Plots

Here is another plot matrix, this time I have used the type of wine to differentiate some of the points to see if this affects any of the relationships.

Lets convert quality to a factor since they are a fixed set of integers.

Now that ive converted to a factors, probably worth looking at the variable matrix again?

There are a couple of intresting distributions that echo my previous analysis, for example the density plot of volatile.acidity shows the quality of wines at higher values is affected (the hump around 0.6), the boxplot of alcohol to quality shows the positive correlation, and the density plot of chlorides, again showing the affect on quality of high concentrations all the little humps after 1.2 are for generally lower quality wine. And finally the bound sulfur dioxide density, which shows generally the good wines are centrered around ~75 whereas the others have a wider variance

ggplot(aes(x=volatile.acidity, color=quality), data=wine) +
  geom_density()

ggplot(aes(x=quality, y=alcohol, fill=quality), data=wine) +
  geom_boxplot()

ggplot(aes(x=chlorides, color=quality), data=wine) +
  geom_density()

ggplot(aes(x=bound.sulfur.dioxide, color=quality), data=wine) +
  geom_density()

I’ll look at a few more variables, again focussing on alcohol content, chlorides, sulphur.dioxide and volatile.acidity to further explore the data.

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wine$quality) and log(wine$chlorides + 1)
## t = -15.476, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2425001 -0.1890976
## sample estimates:
##        cor 
## -0.2159603

These plots suffer from lots of overplotinng so lets now ignore the average wines qualities 5-7 that contain the most points, to look at the very good and very bad wine to see if I can explain the outliers?

Lets have a look at the wines that have a low quality score and a high level of alcohol. The general trend for alcohol as it increases is for the quality to increase, so these outliers are intresting. Lets look at the factors that I have found to effect wine in the past.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6      36      56      58      74     128
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.00   68.75   80.00   83.76   97.25  144.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.013   0.024   0.031   0.029   0.033   0.044
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.02875 0.03300 0.03357 0.03800 0.06000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2400  0.3150  0.3300  0.4473  0.5150  1.1000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2600  0.3200  0.3328  0.4100  0.6600

Between these groups there is a big difference in the bound sulphur dioxide. Very good wines have a range between 42 - 144, very bad from 6 - 128. I’ll try plotting the very low sulphor dioxide scores (less than 40 on the plot)

That explains some of the outliers, but lets look, at some of the others. I’ll remove the points with low bound sulphur dioxide, and try looking at the levels of volatile.acidity on the same plot using size of point.

This explains for another bunch of the outliers, that have high alcohol level, they have too much volatile.acidity. Finally lets remove them points above 0.63

##  fixed.acidity  volatile.acidity  citric.acid     residual.sugar 
##  Min.   :6.00   Min.   :0.2400   Min.   :0.3100   Min.   :0.900  
##  1st Qu.:6.35   1st Qu.:0.2825   1st Qu.:0.3825   1st Qu.:1.312  
##  Median :6.85   Median :0.3200   Median :0.4400   Median :2.100  
##  Mean   :6.95   Mean   :0.3050   Mean   :0.4183   Mean   :3.225  
##  3rd Qu.:7.35   3rd Qu.:0.3275   3rd Qu.:0.4675   3rd Qu.:4.425  
##  Max.   :8.30   Max.   :0.3500   Max.   :0.4800   Max.   :8.000  
##                                                                  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01400   Min.   : 6.00       Min.   : 61.00      
##  1st Qu.:0.02725   1st Qu.:10.50       1st Qu.: 65.00      
##  Median :0.03200   Median :16.50       Median : 76.50      
##  Mean   :0.03017   Mean   :14.17       Mean   : 86.33      
##  3rd Qu.:0.03300   3rd Qu.:18.00       3rd Qu.: 94.00      
##  Max.   :0.04400   Max.   :19.00       Max.   :143.00      
##                                                            
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9893   Min.   :3.030   Min.   :0.3200   Min.   :12.00   3:1    
##  1st Qu.:0.9900   1st Qu.:3.072   1st Qu.:0.3425   1st Qu.:12.12   4:5    
##  Median :0.9910   Median :3.210   Median :0.3800   Median :12.30   5:0    
##  Mean   :0.9906   Mean   :3.170   Mean   :0.4767   Mean   :12.33   6:0    
##  3rd Qu.:0.9911   3rd Qu.:3.235   3rd Qu.:0.5900   3rd Qu.:12.55   7:0    
##  Max.   :0.9914   Max.   :3.300   Max.   :0.7900   Max.   :12.70   8:0    
##                                                                    9:0    
##  acid.sweet.ratio bound.sulfur.dioxide         type   alcohol.bucket
##  Min.   :0.8525   Min.   : 46.00       Dry       :6   (12,13]:5     
##  1st Qu.:1.9048   1st Qu.: 55.25       Medium Dry:0   (11,12]:1     
##  Median :3.4146   Median : 63.00       Medium    :0   (7,8]  :0     
##  Mean   :3.9527   Mean   : 72.17       Sweet     :0   (8,9]  :0     
##  3rd Qu.:6.2000   3rd Qu.: 76.00                      (9,10] :0     
##  Max.   :7.5043   Max.   :128.00                      (10,11]:0     
##                                                       (Other):0
##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :3.900   Min.   :0.1200   Min.   :0.0400   Min.   : 0.800  
##  1st Qu.:6.025   1st Qu.:0.2600   1st Qu.:0.2725   1st Qu.: 2.200  
##  Median :6.600   Median :0.3200   Median :0.3200   Median : 4.200  
##  Mean   :6.473   Mean   :0.3256   Mean   :0.3249   Mean   : 4.733  
##  3rd Qu.:7.000   3rd Qu.:0.3975   3rd Qu.:0.3600   3rd Qu.: 6.100  
##  Max.   :8.000   Max.   :0.5500   Max.   :0.7400   Max.   :14.800  
##                                                                    
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01400   Min.   : 6.00       Min.   : 73.0       
##  1st Qu.:0.02825   1st Qu.:27.25       1st Qu.:104.2       
##  Median :0.03300   Median :34.00       Median :117.0       
##  Mean   :0.03367   Mean   :35.24       Mean   :119.8       
##  3rd Qu.:0.03800   3rd Qu.:39.00       3rd Qu.:135.2       
##  Max.   :0.06000   Max.   :70.00       Max.   :186.0       
##                                                            
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9871   Min.   :2.940   Min.   :0.2500   Min.   :12.00   3: 0   
##  1st Qu.:0.9895   1st Qu.:3.130   1st Qu.:0.3600   1st Qu.:12.30   4: 0   
##  Median :0.9904   Median :3.225   Median :0.4800   Median :12.55   5: 0   
##  Mean   :0.9907   Mean   :3.228   Mean   :0.4808   Mean   :12.62   6: 0   
##  3rd Qu.:0.9915   3rd Qu.:3.348   3rd Qu.:0.5900   3rd Qu.:12.90   7: 0   
##  Max.   :0.9952   Max.   :3.570   Max.   :0.9400   Max.   :14.00   8:86   
##                                                                    9: 4   
##  acid.sweet.ratio bound.sulfur.dioxide         type    alcohol.bucket
##  Min.   :0.5115   Min.   : 42.00       Dry       :80   (12,13]:73    
##  1st Qu.:1.0681   1st Qu.: 69.00       Medium Dry:10   (13,14]:12    
##  Median :1.7143   Median : 81.00       Medium    : 0   (11,12]: 5    
##  Mean   :2.2045   Mean   : 84.60       Sweet     : 0   (7,8]  : 0    
##  3rd Qu.:2.8102   3rd Qu.: 99.75                       (8,9]  : 0    
##  Max.   :8.5625   Max.   :144.00                       (9,10] : 0    
##                                                        (Other): 0

Here we are left with the filtered datasets including pointsn that we cannot explain by the features in our dataset. Perhaps there were a couple of other hidden features that we cant see, like price for example? Maybe the poor wines remaining in our set our massively overpriced and so the taster has taken that into account when applying a quality score? Or perhaps there could be other data issues. Perhaps different tasters were used, perhaps the taster that scored these wines were inexpereinced?

I will try to create a new column on the dataset to classify the wines that have exreme values to see if we can graphically show how it affects quality.

Id like to now create a model using the features I have identified to see how good it is at estimating a wines quality.

## 
## Calls:
## m1: lm(formula = I(as.numeric(quality)) ~ I(alcohol), data = wine)
## m2: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + log(volatile.acidity) - 
##     1, data = wine)
## m3: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + log(volatile.acidity) + 
##     log(bound.sulfur.dioxide) - 1, data = wine)
## m4: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + log(volatile.acidity) + 
##     log(bound.sulfur.dioxide) + log(chlorides) - 1, data = wine)
## m5: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + log(volatile.acidity) + 
##     log(bound.sulfur.dioxide) + log(chlorides) + cat - 1, data = wine)
## 
## ============================================================================
##                               m1        m2        m3        m4        m5    
## ----------------------------------------------------------------------------
## (Intercept)                 0.582***                                        
##                            (0.098)                                          
## I(alcohol)                  0.313***  0.299***  0.304***  0.289***  0.312***
##                            (0.009)   (0.004)   (0.007)   (0.010)   (0.011)  
## log(volatile.acidity)                -0.549*** -0.559*** -0.546*** -0.590***
##                                      (0.029)   (0.031)   (0.031)   (0.034)  
## log(bound.sulfur.dioxide)                      -0.015    -0.030     0.061   
##                                                (0.015)   (0.017)   (0.038)  
## log(chlorides)                                           -0.080*   -0.176***
##                                                          (0.037)   (0.050)  
## cat: Normal                                                        -1.017***
##                                                                    (0.286)  
## cat: HighAcidic                                                    -1.241***
##                                                                    (0.314)  
## cat: HighChlorides                                                 -1.012***
##                                                                    (0.271)  
## cat: LowSulfur                                                     -1.630***
##                                                                    (0.275)  
## ----------------------------------------------------------------------------
## R-squared                      0.190     0.962     0.962     0.962     0.963
## adj. R-squared                 0.190     0.962     0.962     0.962     0.963
## sigma                          0.797     0.772     0.772     0.772     0.767
## F                           1146.395 62528.319 41685.184 31289.003 15869.679
## p                              0.000     0.000     0.000     0.000     0.000
## Log-likelihood             -5839.391 -5683.029 -5682.568 -5680.194 -5644.575
## Deviance                    3112.257  2919.758  2919.208  2916.380  2874.271
## AIC                        11684.782 11372.057 11373.136 11370.387 11307.151
## BIC                        11704.272 11391.547 11399.122 11402.870 11365.620
## N                           4898      4898      4898      4898      4898    
## ============================================================================

This model has a very good R-squared value 0.963 so would indicate that it would be pretty good at estimating the quality of the wine based on the factors identified, and can account for 96.3% of the variance.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Since alcohol strength was identified previously as the variable most affecting quality I set out to see if I could explain outliers in this relationship using the other intresting features that affected quality (volatile.acidic, chlorides, bound.sulphur.dioxide). Using these I was able to explain most of the outliers. The others were probably due to hidden features of the data, data accuracy/quality.

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model using the quality and alcohol. I used the variables I had previously discussed, and was able to get an R^2 value of 0.963, which is pretty good for a linear relationship (the closer to 1 the more linear), the variables in the model account for 96.3% of the variance.

Final Plots and Summary

Plot One

This variety of grape is used to produce mainly dry, or medium dry wines. There are only a small proportion of medium and sweet wines in the sample.

Plot Two

As the alcohol content in wine increases the quality increases. Quality is positively correlated with alcohol content (0.436).

Plot Three

Variance to the quality vs. alcohol correlation can be explained by the level of volatile .acidity being too high, the level of bound.sulphur.dioxide being too low or high level of chlorides.

Reflection

The white wine data set contains information on almost 4900 white wine samples of variants of the Portuguese “Vinho Verde” wine. I started out by trying to understand individual variables and their distributions in my sample. I then went on to explore which variables affected the quality of wine, began to work out what the main causes of variance in my dataset were and created a linear model to estimate quality based on alcohol level, volatile.acidity, chlorides and bound.sulfur.dioxide. From my research I expected the sweet to acidity ratio to have more impact on the quality of wine, but in my data exploration this did not appear to be the case. The model would likely not cope with wines that have a strength higher than those in my sample, and may predict a value outside of the limit that the scoring system allowed (10). I was also not able to fully classify the reason for outliers in the dataset according to the data provided. This may have been caused by many different things, for example hidden variables, testing bias, data quality/accuracy. I would have liked price to also be a variable in the dataset so that I could investigate wheteher price and quality were also correlated.